In this part, we want to constract the microbiom embedding, which is same as the word embedding in NLP containing smantic information
Dataset - 16S seq data
Train data for embedding vectors: AGP feces samples(2,453 x 20,778); all
feces samples(8,770 x 11,4288)
Build the co-occurrence matrix
Russell rao : \[ \frac{a}{n} \]
Russell rao weight : \[
\sum_{s=1}^{n}\frac{1}{\left | sort_{i} - sort_{j} \right | }
\] Jaccard : \[ \frac{a}{n-c}
\] Faith : \[ \frac{a+c/n}{n}
\] Abundance totalsum : \[
\frac{\sum_{s=1}^{n}min(abc_{i}, abc_{j})\times (1-\left | abc_{i}-
abc_{j} \right | )}{n} \] Abundance percentile : \[ \frac{\sum_{s=1}^{n}min(per_{i}, per_{j})\times
(1-\left | per_{i} - per_{j} \right | )}{n} \] Braycurtis
totalsum : \[ \frac{2\times \sum_{s=1}^{n}
min(abc_{i}, abc_{j})}{\sum_{s=1}^{n}(abc_{i}+abc_{j}])} \]
Braycurtis percentile : \[ \frac{2\times
\sum_{s=1}^{n} min(per_{i}, per_{j})}{\sum_{s=1}^{n}(per_{i}+per_{j}])}
\]
n: total number of all samples; a: total number of all present in
samples; c: total number of all absent in samples
sort: sort the abundance of OTUs in the sample from high to low and
return the sorted values
abc: total sum normalization
per: percentile normalization
The Russell Rao calculation only considers whether a species appears in a sample. The Russell Rao weight takes into account the impact of abundance information on co-occurrence when species appear in a sample, i.e., the closer the abundance of species, the higher the weight should be given. Jaccard is based on the presence or absence of species to calculate co-occurrence, but when calculating the co-occurrence value of two species, the total number of samples is subtracted by the number of samples where both species are absent. This approach captures information effectively when two species only appear in a subset of samples but have a high co-occurrence. Faith is also based on the presence or absence of species to calculate co-occurrence, but for species that do not co-occur in some samples, possibly due to technical factors, it increases the count of co-occurrence by \(\frac{c}{n}\). Abundance totalsum is designed based on the relative abundance of species. Species with similar abundance is more closely related in a sample, especially when their abundance is higher, so we calculated \(1-\left|abc_i-abc_j\right|\) to measure the degree of similarity for abundance of two species and \(min\left(abc_i,abc_j\right)\) for higher abundance similarity being given higher weight. Abundance percentile replaces the relative abundance of species with percentile-normalized values on the basis of the Abundance totalsum formula, aiming to enhance the robustness of co-occurrence value calculation. Braycurtis totalsum is also calculated based on the relative abundance of species. This formula effectively reflects the differences in the relative abundance of species in the samples. If the abundance of two species does not differ significantly across all samples, then these two species will have a high co-occurrence value. Braycurtis percentile similarly replaces the relative abundance with percentile-normalized values.
GloVe model to train embedding for microbiome.
Caculate the phylogeny tree of otu.
PCA: apply the PCA in phylogeny of otu distance matrix
glove: convert the phylogeny of otu distance matrix to similarity distance, then GloVe was applyied to acculae the embedding phylogeny: \[ 1-\frac{PD_{ij}}{max(PD)} \] PD: phylogenetic distances matrix
*Loss curve of the GloVe model
Five-fold cross-validation in AGP(Healthy-IBD) data
预训练向量基于全部粪便样本进行训练得到:all feces samples(8,770 x
11,4288)
挑选AGP数据集中的健康样本和IBD样本,进行五折交叉检验验证来评估模型性能。MLP和RF
作为基线为了与我们构建的模型进行比较。在MLP和RF中预训练向量降低模型性能,原始的丰度
数据表现的性能更好,Attention模型中基于预训练向量的模型性能表现会更好相比于随机初始化向量。
其中MLP的OTU模型性能表现是最好的,Attention中的模型没有比其表现更好。
Leave-one-out in IBD cross-dataset
跨数据集分类模型表现性能评估
RF中预训练向量降低模型性能,基于丰度的模型表现最好。MLP中有4个预训练向量的模型
性能稍微好于基于原始丰度的模型。在Attention模型中排名靠前的2个模型性能相较与RF和MLP
有很明显的提升。
Leave-one-out in CRC cross-dataset
CRC中删除了PRJNA763872数据集,之前表现很差,模型预测始终将健康样本预测成CRC样本,
查阅文献该健康样本确实明确指出是健康的,不知道是否是其他原因导致,在这里先删除该数据集。
RF中预训练向量降低模型性能,基于丰度的模型表现最好。MLP中有1个预训练向量的模型
性能稍微好于基于原始丰度的模型。在Attention模型中排名靠前的7个模型的性能相之间性能
较为相似,并且与RF和MLP相比性能有很明显的提升。
Leave-one-out in dietary fiber cross-dataset
因为,有的人群进食膳食纤维后,菌群不会对其有反应。为了更好研究
膳食纤维干预前后肠道菌群的变化,在这里我们过滤掉对膳食纤维干预不太敏感的人群。
过滤方法:分别计算个体在干预前后的组内样本距离和组间样本距离(braycurtis),然后计算
组间距离均值和组内样本均值的比值,最后保留比值大于1.2的个体用于后续分析。
RF中预训练向量降低模型性能,基于丰度的模型表现最好。MLP中各模型之间变现差异不大。
在Attention模型中排名靠前的7个模型的性能与RF和MLP相比有很明显的提升。
The impact of dataset size on the model. Train data on AGP samples :
2453 x 20778
Train data on all feces samples : 8770 x 114288
在AGP数据集上进行预训练的向量与全部粪便训练的预训练向量比较,观察数据集大小对模型
性能的影响。在AGP的IBD数据集中,模型在AGP的预训练向量表现稍好。在其它跨数据集的比较
中增大数据集普遍可以增强模型的表现性能,其中Abundance
percentile的表现最好,测试集的性能
都表现出不同程度的提升,当增大数据集后。因此,后续分析均是基于Abundance
percentile这个公式。
对全部粪便样本进行取子集,观察随着GloVe预训练数据集数量的不断增加对下游Attention模型 性能的影响。预训练向量选用Abundance percentile这个计算方法。
IBD
提取Attention模型中的CLS向量,该向量能表征我们的样本,现在评估该向量是否捕获到我们
的分类分析。t-SNE降维图中CLS向量能更好的区分开疾病和健康的样本。并且通过PVCA分析也证明,
CLS向量增加了疾病类别的信息含量,减少了数据集之间的批次效应。在热图中
能明显看到部分维度的数值和样本分组有明显关联。
CRC
t-SNE降维图中CLS向量能更好的区分开疾病和健康的样本。并且通过PVCA分析也证明,
CLS向量增加了疾病类别的信息含量,减少了数据集之间的批次效应。在热图中
能明显看到部分维度的数值和样本分组有明显关联。
dietary fiber
t-SNE降维图中CLS向量依然无法更好的区分开膳食纤维干预和对照的样本。通过PVCA分析也证明,
CLS向量明显增加了数据集相关的信息含量。在热图中也无法观测出哪些维度的数值和样本分组有明显关联。
SHAP值有正有负,在二分类样本中正值表示对正例样本有正向作用,负值表示对正例样本有反向作用。
数值的绝对值越大表明对样本分类越重要。以下结果根据SHAP值绝对值的大小从大到小排序。
IBD: g__Oscillospira
多项研究表明炎症与颤螺菌密切相关,且大多呈负相关
s__gnavus 研究人员发现IBS-D患者粪便中活泼瘤胃球菌(R. gnavus)增加
CRC: g__Peptostreptococcus : 与CRC的发生发展有关
g__Fusobacterium, s__anaerobius : 报道和CRC相关
dietary_fiber: g__Bifidobacterium; s__longum有很强的信号在
Train data : 2,657 fids, 24.542 samples
预训练使用的方法是:Abundance percentile
OSCC
MLP基于丰度的模型表现最好,Attention的Abundance
percentile模型稍差。t-SNE降维图中CLS向量能更好的区分开疾病和健康的样本。并且通过PVCA分析也证明,
CLS向量增加了疾病类别的信息含量,减少了数据集之间的批次效应。在热图中
能明显看到部分维度的数值和样本分组有明显关联。
SHAP : g__Fusobacterium: g__Streptococcus: